This document explores a dataset containing suicide rates overview from 1985 to 2016.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px
%matplotlib inline
# load in the dataset into a pandas dataframe
df = pd.read_csv('master.csv')
df.head()
# data overview
print(df.shape)
print(df.dtypes)
df.describe()
df.isnull().sum()
df.drop(('HDI for year'), axis=1, inplace=True)
df.info()
df.sample(10)
yearlySum = df.groupby('year').sum()
yearlySum
#removing 2016 because it's incomplete
df = df[df.year != 2016]
df.hist(figsize=(10,8));
sb.pairplot(df)
This document explores a dataset containing suicide rates overview from 1985 to 2016. There are 27,820 rows in the dataset and it follows a format of 1 row and number of suicides per country, year, sex and age group. The variables represent all of main data related to the suicides such as count of suicide, country, population of the country, year, sex and age group, as well as the rate of suicides per 100k . Regarding the data types. The dataset contains numeric and categorical variables.
I'am intersted in exploring the signals correlated to increased suicide rates among the world.
fig, ax = plt.subplots(nrows=3, figsize = [8,8])
variables = ['suicides/100k pop', 'age', 'sex']
for i in range(len(variables)):
var = variables[i]
ax[i].hist(data = df, x = var)
plt.show()
df['sex'].value_counts()
df['age'].value_counts()
It seems that there are an equal number of rows for each age group and each sex and other variables. Thus, I will explore the data with the total number of suicedes for each variable.
Lets invistigate the total suicide distribution among the variables.
age = df.loc[:,['age','suicides_no']]
age['suicides_sum'] = age.groupby(['age'])['suicides_no'].transform('sum')
age.drop('suicides_no', axis=1, inplace=True)
age = age.drop_duplicates()
fig=px.bar(age,x='age', y='suicides_sum', title='Suicide Totals By age',
category_orders={"age":['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years','75+ years']})
fig.show()
Suicide totals are found higher between middle aged indivduals. But is it gonna be also the higher if we consider the rate?
sex = df.loc[:,['sex','suicides_no']]
sex['suicides_sum'] = sex.groupby(['sex'])['suicides_no'].transform('sum')
sex.drop('suicides_no', axis=1, inplace=True)
sex = sex.drop_duplicates()
fig=px.pie(sex,names='sex', values="suicides_sum", title= 'Suicide Totals By Sex')
fig.show()
Male's total suicedes are 3/4 of total suicedes for both sex.
year = df.loc[:,['year','suicides_no']]
year['suicides_sum'] = year.groupby(['year'])['suicides_no'].transform('sum')
year.drop('suicides_no', axis=1, inplace=True)
year = year.drop_duplicates()
fig=px.bar(year,x='year', y="suicides_sum", title= 'Suicide Totals Over The Years')
fig.show()
The largest number of total suicides per year was in 1999.
country = df.loc[:,['country','suicides_no']]
country = country.groupby('country')['suicides_no'].sum().reset_index()
country = country.sort_values('suicides_no')
country = country.tail(20)
fig = px.bar(country, x='suicides_no', y='country', title= 'Suicide Totals By country')
fig.show()
As the plot show, Russia has the higher number of suicedes.
In this section, I will invistigate the relationship between the suicide rates and the other variables.
Let's first explore the suicide trend over the years.
plt.figure(figsize=(8,6))
plt.title('Global Suicide Trend Over Years', fontsize=16)
ys=sb.lineplot(data=df, x='year', y='suicides/100k pop')
ys.set(xlabel='year', ylabel='Suicides per 100k ');
- The suicides rate reach it's peak in 1995 with a rate of 15.5.
- After 1995, the rate deacreased slightly.
Now let's consider the other variables
plt.figure(figsize=(8,6))
plt.title('Global Suicide Rate By Age', fontsize=16)
base_color = sb.color_palette()[0]
age_order = ['5-14 years', '15-24 years', '25-34 years', '35-54 years', '55-74 years','75+ years']
chart=sb.barplot(data = df ,x = 'age',y = 'suicides/100k pop', ci = None, color = base_color, order= age_order)
chart.set_xticklabels(chart.get_xticklabels(), rotation=45);
- Unlike the total suicides, the suicide rate is higher between old aged individuals.
- The rate of suicides are expected to increase with age.
plt.figure(figsize=(8,6))
plt.title('Global Suicide Rate By Sex', fontsize=16)
sx =sb.barplot(data = df ,x = 'sex',y = 'suicides/100k pop' ,ci = None)
sx.set(xlabel='Sex', ylabel='Suicides per 100k');
- The rate of male suicides are three times of female suicide rate.
To get the suicide rate for each country i will get the mean first.
df.groupby(['country'])['suicides/100k pop'].agg(['sum', 'size', 'mean'])
country = df.loc[:,['country','suicides/100k pop']]
country = country.groupby('country')['suicides/100k pop'].mean().reset_index()
country = country.sort_values('suicides/100k pop', ascending=False)
country = country.head(30)
plt.figure(figsize=(20,15))
plt.title('Global Suicide Rates By Country', fontsize=18)
base_color= base_color = sb.color_palette()[0]
chart=sb.barplot(data = country ,x = 'suicides/100k pop',y = 'country', color= base_color)
chart.set(xlabel='Suicides per 100k', ylabel='Country')
sb.despine(left=True, bottom=True);
Clearly Lithuania has the highest suicide rate (41 suicides per 100k)
Here I will invistigate the trend during the years for different variables.
g = sb.FacetGrid(data = df, hue = 'sex', height = 7)
g.map(plt.scatter, 'year','suicides/100k pop')
g.set(xscale = 'log')
x_ticks = [1985, 1990, 1995, 2000, 2005, 2010, 2015]
g.set(xticks = x_ticks, xticklabels = x_ticks)
plt.title('Global Suicide Rates Trend Over The Years (By Sex)', fontsize=16)
g.set(xlabel='Year', ylabel='Suicides per 100k')
plt.ylim([0,100])
g.add_legend();
- During The 80s, the rate for both male and female were low.
- In the mid of 90s and after the male rate increased.
g = sb.FacetGrid(data = df, hue = 'age', height = 7, hue_order=['75+ years', '55-74 years', '35-54 years', '25-34 years', '15-24 years','5-14 years'])
g.map(sb.lineplot, 'year','suicides/100k pop')
g.set(xscale = 'log')
x_ticks = [1985, 1990, 1995, 2000, 2005, 2010, 2015]
g.set(xticks = x_ticks, xticklabels = x_ticks)
plt.title('Global Suicide Rates Trend Over The Years (By Age)', fontsize=16)
g.set(xlabel='Year', ylabel='Suicides per 100k')
g.add_legend();
!jupyter nbconvert Data_Visualization.ipynb --to slides --post serve --template output_toggle